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Abstract 

Mining large-scale high-throughput tandem mass spec- 
trometry data sets is a very important problem in mass 
spectrometry based protein identification. One of the 
fundamental problems in large scale mining of spectra 
is to design appropriate metrics and algorithms to avoid 
all-pair-wise comparisons of spectra. In this paper, we 
present a general framework based on vector spaces to 
avoid pair-wise comparisons. We first robustly embed 
spectra in a high dimensional space in a novel fashion and 
then apply fast approximate near neighbor algorithms for 
tasks such as constructing filters for database search, in- 
dexing and similarity searching. We formally prove that 
our embedding has low distortion compared to the co- 
sine similarity, and, along with locality sensitive hashing 
(LSH), we design filters for database search that can fil- 
ter out more than 989% of peptides (118 times less) while 
missing at most 0.29% of the correct sequences. We then 
show how our framework can be used in similarity search- 
ing, which can then be used to detect tight clusters or 
replicates. On an average, for a cluster size of 16 spec- 
tra, LSH only misses 1 spectrum and admits only 1 false 
spectrum. In addition, our framework in conjunction with 
dimension reduction techniques allow us to visualize large 
datasets in 2D space. Our framework also has the poten- 
tial to embed and compare datasets with post translation 
modifications (PTM). 



1 Introduction 

Proteomics aims to analyze proteins and peptides ex- 
pressed by the dynamic biological processes within 
cells fl31 fn. Proteins are responsible for many inter 
and intra-cellular activities such as metabolism and 
cell signaling where proteins are often modified after 
translation within cells [13, 19 J . In the post-genomic 
era, one of the most important problems is to charac- 
terize the proteome, i.e. the set of proteins within an 
organism. 

Tandem mass spectrometry is one of the most 
promising and widely used high throughput tech- 
niques to analyze proteins and peptides lITSi HI. It 
comprises of two stages. A protein mixture is en- 
zymatically digested and separated by HPLC (High 
Performance Liquid Chromatography) before insert- 
ing into a mass spectrometer through a capillary. 
Then the peptides gets ionized and their precursor 
ion masses, or mass/charge ratios, are measured. 
This is the MSI stage. The peaks (or ionized pep- 



tides) from the MSI stage are selected and further 
fragmented in a second stage using techniques such 
as Collision Induced Dissociation (CID) to yield 
the MS 2 fragment ions. Ideally, each peptide gets 
cleaved into two parts. The N-terminal ion (b-ion) 
represents the prefix while the C-terminal ion (y-ion) 
is the suffix. This stage is also known as the tandem 
MS or the MSMS stage. For more details beyond 
this oversimplified description, the reader is directed 
to the wonderful survey 1 1 1. 

There are two main approaches to analyzing tan- 
dem mass spectra data. First, and the most widely 
used, is the database search method [T0l fT4ll20l l2l. 
Here, peptides from a sequence database are di- 
gested in-silico and the resultant virtual spectra are 
matched (or scored) with the real spectra. High 
scored peptides are typically chosen as the peptide 
candidates. This method leads to a combinatorial ex- 
plosion when used to search for Post Translational 
Modifications (PTMs) |19|. Second, the de-novo 
method |4 3 12 1 reconstructs the sequence without 
the help of a database. Other approaches combine 
denovo sequencing and database search by first gen- 
erating sequence tags, or subsequences, and then us- 
ing these tags 1 6 1 as filters for database search with 
and without PTMs fTTl . 

The promise of tandem mass spectrometry has led 
research groups to routinely use this method to probe 
the proteomes. A single run of a mass spectrome- 
ter can generate several thousands of spectra, and the 
sheer size as well as the number of real life mass 
spectra datasets is predicted to grow at an unprece- 
dented rate with laboratories operating several spec- 
trometers in parallel, round the clock. Thus, efficient 
mining of these large-scale mass spectra data to ob- 
tain useful clues for biological discovery is a very 
important problem. 

Mining large spectra has several challenges, some 
of which are presented below. 1) Indexing huge 
databases of mass spectra is not standardized. Com- 
monly used methods use precursion ion mass but this 
method has two main problems: i) there can be er- 
rors in precursor ion masses, ii) there may be many 
spectra (several thousands of them) that have masses 
close to each other. 2) It is difficult to search for sim- 
ilar spectra on a large scale quickly, or in sublinear 
time. This is a core function used by several data 
mining applications. 3) Clustering large databases of 
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spectra is a daunting task. Most similarity measures 
proposed in tandem mass spectrometry use pair wise 
metrics for similarity. Such pair wise methods lead 
to an explosion of similarity calculations, i.e. O(n^) 
for a set of n spectra. Thus, a key open problem 
is to use methods that avoid the pair-wise similar- 
ity calculations. If objects can be transformed into 
metric spaces, problems such as similarity searching 
and clustering becomes easier. Thus we need to find 
methods to robustly embed spectra in metric spaces. 
4) Visualization of large groups of mass spectra is an 
important problem which can also be used to qualita- 
tively identify outliers in the huge number of spectra 
produced. 

In this paper, we present a general framework for 
large scale mining of tandem mass spectra. Our main 
contributions are the following: 1) We robustly em- 
bed spectra into a metric space, 2) We show, both for- 
mally and empirically that distances using our em- 
bedding areas good as those that use the well known 
cosine method. 3) Then we use apply a geometric 
fast near neighbor search technique. Locality Sen- 
sitive Hashing (LSH) |5|, to solve several problems 
such as fast filters for database search, similarity 
searching of mass spectra, and visualization of large 
spectral database. 4) Our embedding in conjunc- 
tion with PCA and manifold learning can be used 
to visualize large groups of spectra. 5) Our embed- 
ding holds promise for comparing spectra with Post 
Translational Modifications (PTM). 

Our idea of robust embedding of vector spaces to 
mine mass spectra is novel. Previous work to em- 
bed spectra into vector spaces using vectors of amino 
acid counts to database search ^'9\. They focussed 
on clustering sequence databases based on this amino 
acid counts to search for mass spectra, given amino 
acid counts or sequence tags. However getting an 
accurate estimate of amino acid composition is it- 
self a hard problem, especially when the quality of 
spectra is not high. However, our method embeds 
ion fragments of spectra directly into a vector space 
and avoids estimating higher level features such as 
amino acid composition. Also our scheme is more 
general: using a single embedding, we can either 
compare spectra with each other or compare spec- 
tra with peptide sequences by generating their vir- 
tual, or in-silico digested, spectra. In addition, we 
demonstrate that our framework can be used in con- 



crete mining applications. We first use our embed- 
ding along with Locality Sensitive Hashing to speed- 
up database search. We demonstrate that we can fil- 
ter out more than 99.152% spectra with a false neg- 
ative rate of 0.29%. The average query time for a 
spectra is 0.21s. Then, we answer similarity queries 
and find replicates or tight clusters. LSH misses an 
average of 1 spectrum per cluster, that have an aver- 
age cluster size of 16 spectra, while admitting only 1 
false spectrum. 

To the best of our knowledge, we are not aware of 
any other work that robustly embeds spectra in met- 
ric spaces with provable guarantees and then uses 
fast approximate near neighbor techniques to solve 
mass spectrometry data mining problems. 

2 Methods 

Our approach is to use vector spaces which have been 
successful in numerous data mining applications in- 
cluding web searchingcite web mining. Several fast 
mining algorithms become simpler to design in these 
spaces, compared to designing them in non metric 
spaces e.g. spaces where the only available measure 
is a pairwise similarity measure. Thus, the key prob- 
lem in this approach is to robustly embed spectra into 
a high dimensional metric space and define appro- 
priate distances. Also, these distances must be cor- 
related with the well known cosine similarities. In 
other words, we desire an embedding with bounded 
distortion with respect to the cosine similarity. 

2.1 Embedding Spectra 
Noise Removal 

The achiles heel of tandem mass spectra analysis is 
the amount of noise in the mass spectra. In fact, 
most peaks (around 80%) cannot be explained and 
are called 'noise' peaks. 'Signal' peaks (such as 6, y 
ions) are useful for interpretation. As a first step, we 
remove noise peaks enriching the signal to noise ra- 
tio (SNR). 

We use a statistical method to increase SNR. We 
first find the intensity distributions of signal and 
noise peaks in a set of annotated spectra. For this, 
we consider a set of good quality annotated spectra as 
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Figure 1: Signal and Noise distributions of peak 
intensities in different regions of spectra (from the 
training set). 

described in Section |3l and generate the virtual spec- 
trum Vp for each of the real spectra Vp for a peptide p. 
For the virtual spectrum generation we consider the 
following ions: h,h — H2O, b — NH^, y, y — H2O, 
y — NH^. Then we divide the mass range of Vp into 
k = 10 sections. For each section, and for each real 
peak, we consider its intensity rank i.e. the most in- 
tense peak has rank and so on. We divide the peaks 
of Tp into two sets Sp and Np. Sp contains all those 
peaks, and their intensity ranks, which have a match 
in the virtual spectrum Vp. Thus, for each region, we 
can get a distribution of signal and noise intensity 
ranks for each region as shown in Figure ITTI 

We define a metric SNR of a peak {mzj , Ij ) as 
follows 

a^r.(-^ ^ P[rank(j)|(mz,,/,) £ Sp] 
P[mnk{j)\{mz^,I,)eNp] 

If larger SNR, the peak is likely to be a useful peak, 
else its a noise peak. From Figure 12.11 we can con- 
clude that the noise is very poor at the ends of the 
spectra, i.e. at low mass regions and high mass 
regions. This statistical observation reinforces the 
mass spectrometry folklore that the middle region is 
the most suitable for finding signal peaks. 



Figure 2: (i) Embedding spectra in a n-dimensional 
cube, (ii) Using a 2-dimensional example to illustrate 
the correlation between the Euclidean distance and 
the well known cosine similarity 

Features and Distances 

There are several possible ways to embed tandem 
mass spectra into a vector space that support the most 
common operation of comparing two spectra and 
find similarities. For example, the cosine similar- 
ity metric 1 10| and their different variants have been 
very popular in the recent papers. Unfortunately the 
cosine metric does not yield a metric embedding be- 
cause the triangle inequality is violated. Also the 
cosine similarity metric implies algorithms that con- 
sider pairs of spectra. Clearly such algorithms are 
difficult to scale due to the O(n^) number of similar- 
ity calculations. 

For metric embeddings, the design space is quite 
large. A simple idea is to directly bin the peaks and 
use the intensities to form a vector space. However 
spectra from different datasets have different inten- 
sities and we would like to have a single embed- 
ding that could potentially integrate multiple spectral 
databases. 

We first clean spectra as mentioned in the previ- 
ous subsection. Then we divide the entire mass range 
(from to some maximum range) into discrete inter- 
vals of 2da. For each interval of 2da, a bit is set to 1 if 
the cleaned spectrum contains a peak in that interval, 
else it is 0. This embeds each spectra into the ver- 
tices of a n-dimensional hypercube. A 3D version is 
shown in Figure im Our feature vectors are defined 
to be the the unit vectors in the direction of the cor- 
responding vertices of the n-dimensional hypercube. 
Thus the space of our embedding is a n-dimensional 



unit hyper-sphere. 

We define the spectral similarity or distance be- 
tween spectra re, ?/, as ||x — If the angle between 
two similar spectra x, y is 9, cos 9 will be close to 1, 
or 1 — cos 9 will be very small. Since x y are unit 
vectors, their Euclidean distance will also be small. 
Thus, for small angles, 1— cos 9 D{x,y), where D 
is the Euclidean distance. It is easy to show that as n 
or the number of dimensions increases, the minimum 
angle for pairs of very similar spectra x, y becomes 
smaller. Thus, instead of calculating the 1— cos 9, we 
calculate D{x, y). The natural question that arises is 
the distortion of our embedding. We will now show 
that it is has bounded accuracy in theory, and we will 
later show that the accuracy is empirically quite high 
in comparision with the cosine similarity. 

We prove some properties of the embeddings. It is 
easy to show the following theorem: 

Theorem 2.1. The embedding discussed above de- 
fines a metric space. 

Proof. The proof is very simple To show that our 
embedding defines a metric space, we need to prove 
three things: 1) | — y| | = iff .x = y, 2) | |a: — y| | = 
I |y — x| I and 3) the distance measure obeys the trian- 
gle inequality. These properties are trivial to prove 
in our case as our embedding uses EcuUdean dis- 
tances. □ 

We then show that the maximum eucUdean dis- 
tance is bounded by \/2. 

Lemma 2.1. The distance between the feature vec- 
tors of any two mass spectra is bounded above by 

Proof. Suppose there are two spectra x, y respec- 
tively. We shall uses the names of the spectra and 
their feature vectors interchangeably. According to 
our scheme we first filter the noisy peaks and gen- 
erate the binary vector after binning. Now assume 
X has k bits set to a and y has k' bits set to 1. 
Also assume that c of the common bits are 1 . Then 
11x11 = ^ and llyll = Since c bits are com- 
mon, the number of dissimilar bits between x and y 
are {k — c) -\- {k! — c). We have 
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□ 

Next we show that our embedding has bounded 
distortion when we compare with the well known co- 
sine similarity. We have the following theorem: 

Theorem 2.2. If 9 is the angle made by the feature 
vectors of spectra x, y, and the number of ones in 
each of the vectors after binning is the same we must 
have < < Or in other words, the 

distortion between our Euclidean embedding and the 
cosine similarity is bounded. 

Proof. As in the previous lemma, ||x — ?/|| = 
^2 — c- + y\ ^oV'I the cosine of the angle 9 be- 
tween x, y can be written as cos 9 = Assume 
k = k' and note that < | < 1. Thus, we must have 




We note that since, < | < 1, we must also have 
< 1 — I < 1 and the theorem follows. □ 

Thus our embedding will perform almost as good 
as the standard cosine metric. We show in the next 
section that this is indeed the case, empirically. Also, 
since the points are in a Euclidean space, we can el- 
egant geometric techniques that yield fast approxi- 
mate algorithms for mining the data. 

2.2 Similarity Searching 

The ability to calculate distances as opposed to 
cosines is an important feature of our framework. 
Now, we apply elegant near neighbor algorithms to 
answer queries quickly but approximately, as we 
show in the paper. The basic query primitive we use 
is the following: 
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Primitive 1 : Given a spectrum x and a set of spectra 
S, we want to find all tiie spectra Sr that are similar 
to X, i.e. spectrum y G Sr, iff D{x,y) < r^, where 
D is the Euclidean distance and the is a query ra- 
dius. 

A very simple approach would be to do a linear 
scan on the database and output every spectrum y 
such that D{x, y) < Vg. This takes 0(n) time. How- 
ever, if S becomes very large and so do the num- 
ber of queries say 0{n), then we have a 0{n?) algo- 
rithm. This is clearly unacceptable for our problem. 
Thus, we desire methods that will yield near neigh- 
bor queries in sub-linear time. For this we are willing 
to tradeoff some accuracy for speedup. Several sub- 
linear near neighbor methods exist but we leverage 
Locality Sensitive Hashing |5| since, unlike others, 
it promises bounded guarantees and is also easy to 
implement. We briefly present the idea below. 

Locality Sensitive Hashing 

The basic idea behind random projections is a class 
of hash functions that are locality sensitive i.e. if two 
points {p,q) are close they will have small |p — 
and they will hash to the same value with high prob- 
ability. If they are far they should collide with small 
probability. 

Definition 1: A family {H = f : S ^ U} is called 
locality-sensitive, if for any point q, the function 

p{t) = PrH[h{q) = h{v) : \q — v\ = t] 

is strictly decreasing in t. That is, the probability 
of collision of points q and v is decreasing with the 
distance between them. 

Definition 2: A family H = {h : S ^ U} is called 
(ri, r2,pi,p2) sensitive for distribution D if for any 
v,q e S, we have 

• ifve B{q, ri) then Pr[h{q) = h{v)] > pi 

• ifv^ B{q, r2) then Pr[/i(g) = h{v)] < p2 

Here B{q, r) represents a ball around point q with a 
radius r. Thus a good family of hash functions will 
try to amplify the gap between pi and p2- 

Indyk et. al. |5| showed that s-stable distributions 
can be used to construct such families of locality sen- 
sitive hash functions. An s-stable distribution is de- 
fined as follows. 



Definition 3: A distribution D over R is called s- 
stable, if there exists s such that for any n real num- 
bers v\...Vn and i.i.d. variables Xi...Xn with distri- 
bution D, the random variable Yl,i ViX-i has the same 
distribution as the variable (X^i^f)^^' where X is 
a random variable with distribution D. 

Consider a random vector a of n dimensions. For 
any two n-dimensional vectors (p, q) the distance be- 
tween their projections {a.p — a.q) is distributed as 
\p—q\sX where X is a s-stable distribution. We chop 
the real line into equal width segments of appropri- 
ate size and assign hash values to vectors based on 
which segment they project onto. The above can be 
shown to be locality preserving. 

There are two parameters to tune LSH. Given a 
family H of hash functions as defined above, the 
LSH algorithm chooses k of them and concatenates 
them to amplify the gap between pi and p2. Thus, 
for a point v, g{v) = {hi{v)...hk{v)). Also, L such 
groups of hash functions are chosen, independently 
and uniformly at random, (i.e. gi-.-QL) to reduce the 
error. During pre-processing, each point v is hashed 
by the L functions buckets and stored in the bucket 
given by each of gi{v). For any query point q, all the 
buckets gi{q)---gL{q) are searched. For each point 
X in the buckets, if the distance between q and x is 
within the query distace, we output this as the near- 
est neighbor. Thus, the parameters k and L are cru- 
cial. It has been shown 0|5l that k = logi/p^ ^i^d 
L = nP, where p = ensures locality sensi- 

tive properties. In Ref. |5|, the authors consider L2 
spaces and bound p above empirically by i, c be- 
ing the approximation guarantee, i.e. for a given ra- 
dius R, the algorithm returns points whose distance 
is within c x R. The time complexity of LSH has 
been shown to be 0{dnP log n), where d is the num- 
ber of dimensions and p is as defined above. Thus, if 
we desire a coarse level of approximation, LSH can 
guarantee sub-linear run times for geometric queries. 

2.3 Similarity Searching 

Using our embedding and a fast near neighbor algo- 
rithm, we can find spectra similar to a given query 
spectrum. The key is to use the correct query radius 
r. We show in the next section how this can be cho- 
sen. If we give too high a radius, it might yield a 
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large dataset and if the radius is too low, it might not 
yield any neighbor. 

If an appropriate query radius is chosen, it is easy 
to find tight clusters using the following heuristic: 
ANN-cluster: 1) Embed spectra into a Euclidean 
space and form the set S. 2) Hash the feature vec- 
tors, S, using LSH. 3) Choose some k random spec- 
tra, find their near neighbors (tight clusters). For 
each random spectra add their neighbors to set S. 4) 
S = S - C. 5) Goto step 3 till S is empty. 

Another immediate consequence of our frame- 
work is to find outliers. To check for outlier, we 
need to determine whether a spectrum has at most 
1 or 2 neighbors. If the neighbors remain unchanged 
even on increasing the query radius by 6, a spectrum 
is indeed an outlier. Since near neighbors take sub- 
linear time with LSH, outliers can be detected in sub- 
quadratic time. 

2.4 Speedup Database search 

In this section, we discuss a sample application us- 
ing our mining framework. Database search is the 
primary tandem mass spectrometry data mining ap- 
plications. Given a query spectrum x, and a mass 
spectra database MSDB (described in Section|3l the 
problem is to find out which peptide p e MSDB 
corresponds to x. 

Database search is a well explored topic, see fl^ 
for example. Most tools index the the MSDB by 
the peptide mass. Then for a spectrum x, the pre- 
cursor mass nix is found. Then all the spectra 
= y\y ^ MSDB are compared with x such that 
\my — ■mx\ < S, where 6 is some pre-defined mass 
tolerance. Each comparison operation between the 
query spectrum and the candidate spectrum takes a 
while depending on the scoring function used. We 
reduce the size of Sp by filtering the unrelated spec- 
tra, speeding up the search. We ensure that we do not 
filter out the true peptide for a given spectrum while 
we discard most of the unrelated peptide. 

We generate the virtual spectra from each peptide 
sequence in the database, and then embed those vir- 
tual spectra in the Euclidean space, as mentioned. 
Then for filtering, we choose an appropriate thresh- 
old radius r and query the LSH algorithm to yield all 
the candidates within a ball of radius r. The ratio of 
the total number of peptides within a mass tolerance 



divided by the number of candidates returned is our 
speedup. 

2.5 Visualization and Dimension Reduction 

As mentioned earlier, vizualizing thousands of spec- 
tra is a very hard problem. We are not aware of any 
previous work that allows us to visualize large mass 
spectrometry data sets. Our embedding followed by 
dimension reduction allows to view spectra on a two 
or three dimensional space. As a bonus, it qualita- 
tively allows us to identify outliers in the data set. 

Once we have embedded the spectra in a Eu- 
clidean space, we can use some of the common tech- 
niques to visualize high dimensional data by dimen- 
sionality reduction. The most common linear method 
is to use PC A |16|. Recently, several non-linear 
methods for dimensionality reduction have been dis- 
covered, the majority of them exploiting the low 
dimensional manifold structure of the dataset. In 
this paper, we leverage one of these techniques, the 
isomap method, to project the high dimensional data 
on a 2D plane. Due to lack of space we do not pro- 
vide a description of the method. 

3 Experimental Results 

In this section, we describe the empirical evaluation 
of our embedding followed by some representative 
data mining tasks. Unless otherwise stated we use 
the following dataset from Keller et. al. |11|. For 
calculating statistics, we used 80% of the 1618 spec- 
tra from this annotation at random. The statistics 
were independent of the exact choices of the spec- 
tra. Note that our techniques are unsupervised ex- 
cept for the selection of query radii. Out of this, 1014 
spectra were digested with trypsin and were used for 
database search filter. 

For database search filters, a non-redundant pro- 
tein sequence database called MSDB, which is main- 
tained by the Imperial College, London. The release 
(20042301) has 1,454,651 protein sequences (around 
550M amino acids) from multiple organisms. Pep- 
tide sequences were generated by in-silico digestion 
and the list of peptides were grouped into different 
files by their precursor ion mass, a different file for 
lOda. 
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3.1 Empirical evaluation of the embedding 



In this section, we critically analyze our embedding 
and different distance metric. For these analyzes, we 
chose a set of 1014 curated spectra of proteins di- 
gested with trypsin and reported by Keller et. al. 
We then cleaned the spectra picked the most likely 
to be the signal peaks. Then we constructed the bi- 
nary bit vector as discussed earlier. For the set of 
spectra, we knew that there were 100 odd clusters 
with 15 spectra per cluster on an average. We cal- 
culate the pairwise distances between spectra within 
the same cluster and we term this the similar set SS. 
We then choose a representative from each cluster at 
random and calculate the distances and we call this 
set the dissimilar set DS. Then we plot the frequency 
distribution of DS and SS as they both have similar 
number of pairwise distances in Figure ITT] for three 
metrics: hamming, 1-cosine and euclidean. Its very 
clear that hamming is unsuitable as a metric as it has 
low discriminability. As expected, 1-cosine and eu- 
clidean looks almost similar with low overlaps be- 
tween the sets DS and SS. Also note that the cosine 
metric used here is not exactly the same used by oth- 
ers. We do not take the intensities into consideration 
after we have selected the peaks. 




Hanmg distance 





Figure 3: Distribution of scores with real spec- 
tra using different metrics (hamming, 1-cosine, eu- 
clidean). The dotted curve plots the inter-cluster dis- 
tances while the solid line represents the intra-cluster 
distribution. 



Now, we consider the database of tryptic peptides, 
MSDB. For each peptide, we generate its virtual 
spectrum and then construct the feature vector as 
above. For each real spectrum, we calculate the dis- 
tance with the correct virtual spectra and we call 
this set of scores to be SS. Then we choose, from 
the database, 100 random peptides having almost the 
same mass as the precursion ion mass of the given 
spectrum. We then add the set of scores to the dis- 
similar set DS. We then plot the probability distri- 
bution of SS and DS in Figure 13.11 Again we can 
see the clear sepatation between the two sets of dis- 
tances (with < 1% overlap). This indicates that the 
efficacy of euclidean distance in our embedded space 
is a good metric to design filters for database search. 
Note the sharp impulse at 1.414 corresponding to dis- 
tances between real spectra and completely dissimi- 
lar peptides within a mass tolerance of 2da, providing 
empirical evidence for Lemma 2.2. 



3.2 Post Translational Modifications 

Now we present some very preliminary results on a 
set of spectra from the PFTau protein. We picked 8 
good quality spectra with known Phosphorylations. 
We wanted to study whether our metric can help de- 
sign filters that might work for PTM studies. From 
the Figure IT2I we note that distances between spec- 
tra and their PTM variants have a higher likelihood 
of being classified as similar than dissimilar. This is 
evident from Figure ITTI 

3.3 Query processing using LSH 

In this section, we quantify the accuracy of our 
framework for similarity searching and clustering. 
As mentioned earlier, we use LSH to answer queries 
with bounded errors in expected sub-linear time. 
We first indexed the 1014 spectra using our em- 
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Figure 6: The average number of spectra that are 
present in the cluster containing the query spectrum 
but are missed by LSH 



Figure 4: Distribution of distance between real and 
virtual spectra using different metric. The dotten 
curve represents the distance between real spectra 
and distances to virtual spectra from 100 different 
peptides of similar precursor masses. The sequences 
are from MSDB. The other curve shows the distri- 
bution of distances between spectra and the virtual 
spectra from the true peptides. 
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Figure 5: Some sample distances between spectra 
and their PTM variants. Note the low scores between 
the pairs. Distances between spectra of different pep- 
tides had a mean fj, = 1.388 and a = 0.017. 



bedding followed by LSH. For each of the 1014 spec- 
tra, we queried LSH with a radius r. We varied r. 
We plot the number of missed spectra that were ac- 
tually present in the cluster of the query spectrum in 
Figure 13.31 and the number of false positives in Fig- 
ure 13.31 As we increased the radius, we the num- 
ber of misses decreased. This is expected as the ra- 
dius of the query ball increases the number of possi- 
ble data points that can be considered. As expected, 
the number of false positives also increased as r in- 
creased. This indirectly demonstrates the accuracy 
of any clustering algorithm based on LSH. We miss 
an average of 1 spectrum within each cluster while 
admitting only 1 false spectrum. 

Atr = 1.0 — 1.1 the false positives are not very 
high. This might be important when we want to 
query for similar spectra in order to generate the con- 
sensus spectra. In such situations, it might be fine to 
miss out some bad quality spectra (distances to bad 
quality spectra are usually higher). Also, consider 
situations where we would like to coarsely partition 
the data set (e.g. for clustering). Then, we can afford 
to have a few false positives but we cannot miss any 
true positives. In such cases we increase the radius to 
at most 1.25 as the likelihood of a intra-cluster dis- 
tance being greater than 1.25 is low, from Figure ITT] 

3.4 Speeding up Database Search 

To test the efficacy of our framework on speeding 
up database search, we first use our metric to filter 
out candidate spectra. Since our distance calcula- 
tion is much faster than the detailed scoring of two 
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LSH query radius 



Figure 7: The average number of spectra that are not 
present in the cluster containing the query spectrum 
but are reported by LSH 




Figure 8: Filtering of spectra for DBASE search 
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(i) PCA (ii) Isomap 

Figure 9: Dimension Reduction with Isomap 

3.5 Visualization and Dimension Reduction 

Consider the training dataset of mass spectra. We 
first generate Euclidean feature vectors for each 
spectra. Then we used PCA and plotted the first two 
components on the x-axis and the y-axis as shown in 
Figure IT3l i). The clusters are visible and so are the 
outliers. But the visualization is coarse grained. 

Then we use Isomaps on the same dataset. Re- 
call that in Isomaps, one first needs to calculate the 
near neighbors. Thus in our plot, we also show the 
near neighbor graph along with the projected points 
as shown in Figure lTsT ii). The cluster structure seem 
to be qualitatively clearer than with PCA. 



spectra, we define speedup by the ratio of total num- 
ber of candidate peptides with a mass tolerance of 2 
daltons and the total number of peptides that have a 
distance of A with the query spectrum and have the 
same mass tolerance. Then we increase A and cal- 
culate the number of true peptides missed in this fil- 
tering process. In Figure IT4l we plot the speedup on 
a logarithmic scale against the miss percentage. This 
gives us the speedup (or quality of filtering) versus 
accuracy tradeoff of using our framework. For a 2 
dalton range the number of peptides are around 100- 
200K. For around a a lOOK peptide set, LSH takes 
0.2Is on an average to answer queries. As we see 
from Figure 13.41 we can get an average speedup of 
118 if we allow 0.19% misses. This may be reason- 
able for many applications. In fact, we found that 
our errors were due to low quality spectra in our test 
dataset. 



4 Discussion 

The results in the previous section look promising. 
The clear separation between the DS and SS set dur- 
ing the metrics comparision was a surprise to us, ini- 
tially. One of the reasons for the good result is the 
quality of the dataset. We first wanted to validate our 
simple assumptions and claims on a dataset which 
had reliable interpretations. Since we first transform 
the spectra into binary bit strings we avoided the 
huge variations of density in spectra. The signal to 
noise ratio pilot study also underscored the fact that 
we need to study spectra by segmenting them. Note 
that one reason why we obtained clear separations 
between the DS and SS in all cases with our embed- 
ding is that we avoided using precursor ion mass as 
a feature. Even though its fine to use the precursor 
mass as a coarser grain filter, it will lead to less ro- 
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bust embeddings as such masses are prone to errors 
due to isotope effects. Also our theoretical results 
will not hold. 

For LSH, the speed and the accuracy is quite sat- 
isfying. However, there are two implementation is- 
sues. Our current indexing is memory bound. This 
means we need lots of memory to index millions 
of mass spectra. Even though this is possible with 
the current 64 bit machines, we need to design disk 
based LSH schemes. We are working on a large 
scale implementation of our framework based on 
such techniques. Another issue is the choice of the 
number of bins and the mass coverage. Increasing 
the number of bins leads us to the curse of dimen- 
sionality which would slow down LSH and reduce 
the filtering speedup. If we choose fine grained bins 
with a lower maximum mass, our embedding will 
result in a pseudo-metric space as several different 
spectra will now satisfy assumption one in Theorem 
2.1. 

5 Conclusions and Future Work 

In this paper, we showed that our embedding with 
geometric algorithms provides a good framework for 
mining mass spectra. In particular, we have demon- 
strated both theoretically as well as empirically, that 
our embedding coupled with Euclidean distance per- 
forms as well as the well known cosine similarity 
while providing us with the benefits of a metric space 
and enabling us to use approximate sub-linear time 
near neighbor techniques for data mining. Using this 
framework, we showed how we can do similarity 
searches and find tight clusters. Also, we demon- 
strated that we can get 2 order of magnitude filtering 
for database search. As an aside, we are also able 
to visualize large datasets in two dimensions qualita- 
tively identifying the outliers. 

This work is the first step in the direction of an in- 
tegrated framework for large scale mining of tandem 
mass spectra using simple techniques from embed- 
dings, vector spaces and computational geometry. 
Several directions are being investigated at this point. 
The main areas of investigation are 1) Better em- 
beddings that offer better resolution for PTM spectra 
2) Faster external database searching algorithms that 
use embedding 3) More effective blind PTM search- 



ing using embeddings 4) Large scale clustering and 
visualization of mass spectrometry data and 5) Inte- 
grating data from different sources using our embed- 
dings. 

We should note that several sections in the paper 
could be of independent interest. For example, we 
need to explore the probabilistic cleaning of mass 
spectra in more details. Our embedding promises 
to work across datasets and this general method can 
be used to do integrated study of other biological 
datasets eg. microarray data sets. 
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